Alberta In-situ Oilsands Production Analysis using Python - Part II

Preparing a tidy dataset for data analysis and visualization

python

pandas

data analysis

data visualization

Author

Farshad Tabasinejad

Published

March 4, 2023

Disclaimer

This blog post is for educational purposes only. Any commercial use of the information provided in this blog post is prohibited. For more information about the AER copyright and permission to reproduce, please visit AER Copyright and Disclaimer. The author is not responsible for any damage or loss caused by the use of the information provided in this blog post.

Introduction

This blog post focuses on analyzing the bitumen production data from in-situ oilsands projects in Alberta. Alberta Energy Regulator (AER) lists the monthly injection and production data for all in-situ oilsands projects in Alberta on its website Alberta Energy Regulator. The data is available in several spreadsheets and can be downloaded from ST53: Alberta In Situ Oil Sands Production Summary. The annual data are reported in separate files in .xls format. Each file has 6 different sheets with the following information: bitumen production, water usage, steam injection, well count, steam-to-oil ratio (SOR), and water-to-oil ratio (WSR).

In the previous post Alberta In-situ Oilsands Production Analysis using Python - Part I, I have shown how to download the data using python. In this post we create a tidy bitumen dataset for data analysis and visualization.

Loading the Data

All files are stored in the current local directory. We use the glob module to list all .xls files in the directory and then use pandas to read the data into a DataFrame. The xlrd is used as an engine in pandas to read the .xls files. The numpy module is used to perform mathematical operations on the data. The warnings module is used to suppress the warnings.

import glob
import pandas as pd
import numpy as np
import xlrd
import warnings
warnings.filterwarnings('ignore')

The glob module is used in this code to create a list of file paths for all the .xls files in the current directory.

xls_files = glob.glob('*.xls')
xls_files

['ST53_2010-12.xls',
 'ST53_2011-12.xls',
 'ST53_2012-12.xls',
 'ST53_2013-12.xls',
 'ST53_2014-12.xls',
 'ST53_2016-12.xls',
 'ST53_2017-12.xls',
 'ST53_2018-12.xls',
 'ST53_2019-12.xls',
 'ST53_2020-12.xls',
 'ST53_2021-12.xls',
 'ST53_2022-12.xls']

Analyzing the 2010 bitumen production data

This section describes the steps for creating a tidy dataset for bitumen production using the 2010 data.

Reading the first file

To start, we read the data from the ST53_2010.xls file using the read_excel function from the pandas library. We set the sheet_name argument to None to read all the sheets in the file. We also use the skiprows argument to skip the first three rows of each sheet. Finally, we use the keys() method to print the names of the sheets in the file.

sheets = pd.read_excel(xls_files[0], sheet_name = None, skiprows = 3)
print(f'sheet names: {sheets.keys()}')

sheet names: dict_keys(['BITUMEN', 'WATER', 'STEAM', 'WELLS', 'SOR', 'WSR'])

Reading the BITUMEN sheet

# read the BITUMEN sheet
bitumen_2010 = sheets['BITUMEN']

The info method is used to print the column names and data types of each column in the DataFrame.

bitumen_2010.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Operator         172 non-null    object 
 1   Scheme Name      26 non-null     object 
 2   Area             163 non-null    object 
 3   Approval Number  163 non-null    object 
 4   Recovery Method  164 non-null    object 
 5   Jan              166 non-null    float64
 6   Feb              166 non-null    float64
 7   Mar              166 non-null    float64
 8   Apr              165 non-null    float64
 9   May              165 non-null    float64
 10  Jun              162 non-null    float64
 11  Jul              162 non-null    float64
 12  Aug              162 non-null    float64
 13  Sep              161 non-null    float64
 14  Oct              159 non-null    float64
 15  Nov              159 non-null    float64
 16  Dec              156 non-null    float64
 17  Monthly Average  153 non-null    float64
dtypes: float64(13), object(5)
memory usage: 25.0+ KB

The head method is used to display the first 5 rows of the DataFrame.

bitumen_2010.head()

	Operator	Scheme Name	Area	Approval Number	Recovery Method	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Monthly Average
0	Canadian Natural Resources Limited	NaN	Cold Lake	4746P	Commercial	587.48	596.72	637.59	678.74	705.61	718.99	697.25	696.88	763.21	702.21	696.32	674.99	679.67
1	Canadian Natural Resources Limited	NaN	Cold Lake	6726I	Commercial	171.54	193.66	187.71	177.54	176.90	174.50	160.53	155.97	168.22	168.27	139.73	143.48	168.17
2	Baytex Energy Ltd. (¹)	Cliffdale Pilot	Peace River Area 2	11034E	Commercial-CSS	0.00	2.55	6.15	2.25	4.16	10.59	5.83	3.27	9.18	20.09	15.42	13.15	8.42
3	North Peace Energy Corp.	Red Earth	Peace River Area 2	11209A	Commercial-CSS	10.30	4.70	3.11	2.21	1.63	0.51	0.00	0.00	0.40	0.00	0.00	0.00	3.27
4	Penn West Petroleum Ltd.	Seal	Peace River Area 2	11377A	Commercial-CSS	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	NaN

The tail method is used to print the last 20 rows of the DataFrame.

bitumen_2010.tail(20)

	Operator	Scheme Name	Area	Approval Number	Recovery Method	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec	Monthly Average
157	Bronco Energy Ltd.	NaN	Athabasca	9774E	Primary	28.65	28.58	27.94	22.84	91.39	80.23	77.59	83.21	64.28	74.45	69.66	63.98	59.40
158	Devon Canada Corporation	NaN	Cold Lake	9821	Primary	4.00	3.41	3.42	0.00	1.87	0.00	1.74	1.04	4.18	4.09	1.94	2.76	2.85
159	Frog Lake Energy Resources Corp.	NaN	Cold Lake	9827B	Primary	11.42	5.22	5.11	8.76	12.08	16.21	13.78	11.12	17.36	18.75	19.41	22.18	13.45
160	Husky Oil Operations Limited	NaN	Cold Lake	9873	Primary	12.32	12.80	12.66	11.27	12.93	11.56	11.63	11.27	10.62	10.79	11.50	10.50	11.65
161	Canadian Natural Resources Limited	NaN	Athabasca	9884D	Primary	65.64	68.55	69.79	64.72	68.09	64.18	63.37	66.10	64.25	62.55	59.87	55.29	64.37
162	Koch Exploration Canada G/P Ltd. (¹)	NaN	Cold Lake	9908D	Primary	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	NaN
163	NaN	NaN	NaN	NaN	Total	109423.83	116311.68	120054.09	116601.43	121671.30	122439.61	122234.75	121215.19	108260.73	118808.23	128196.96	129913.57	119594.28
164	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
165	Conventional Bitumen Recovery	NaN	NaN	NaN	NaN	537.49	544.67	560.36	541.87	560.71	524.96	482.84	555.27	555.21	545.33	601.13	593.86	550.31
166	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
167	Total	NaN	NaN	NaN	NaN	109961.32	116856.35	120614.45	117143.30	122232.01	122964.57	122717.59	121770.46	108815.94	119353.56	128798.09	130507.43	120144.59
168	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
169	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
170	Notes:	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
171	- Monthly production totals may be revised due...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
172	- Total may not add due to rounding	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
173	- Conventional Bitumen production consists of ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
174	- Monthly average does not include months ther...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
175	¹ Amended this month	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
176	¹ Amended this month	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Since the number of producers in the file can vary, we cannot rely on a fixed index number to subset the dataset. However, the last row of the dataset always contains a summary of the total production, which is identified by the cell value of Total in the Recovery Method column. Therefore, we can use the index number of the row with Total in the Recovery Method column to subset the dataset and exclude the summary row and any other rows with additional information.

# get the index of the last row using np.where
last_row_idx = np.where(bitumen_2010[["Recovery Method"]] == "Total ")[0][0]
bitumen_2010 = bitumen_2010.iloc[:last_row_idx, :]

In addition, we can remove the Monthly Average column from the dataset since we will be aggregating the data in our analysis and therefore do not need the monthly averages.

bitumen_2010 = bitumen_2010.drop(columns = ['Monthly Average'])

Cleaning the Operator column

Let’s take a look at the Operator column.

bitumen_2010['Operator'].unique()

array(['Canadian Natural Resources Limited', 'Baytex Energy Ltd. (¹)',
       'North Peace Energy Corp.', 'Penn West Petroleum Ltd.',
       'Baytex Energy Ltd. (¹²)', 'Shell Canada Limited',
       'Imperial Oil Resources (¹)',
       'Canadian Natural Resources Limited (¹)',
       'Devon NEC Corporation (¹)', 'Connacher Oil And Gas Limited',
       'Meg Energy Corp.', 'Statoil Canada Ltd. (¹)',
       'Cenovus Energy Inc.', 'Cenovus FCCL Ltd.', 'Suncor Energy Inc.',
       'Japan Canada Oil Sands Limited (¹)', 'Suncor Energy Inc. (¹)',
       'Total E&P Joslyn Ltd. (¹)',
       'ConocoPhillips Canada Resources Corp.',
       'ConocoPhillips Canada Limited', 'Nexen Inc.',
       'Husky Oil Operations Limited', 'Canadian Natural Resources',
       'Murphy Oil Company Ltd.', 'Cenovus Energy Inc. (¹)',
       'Prosper Petroleum Ltd.', 'Penn West Energy Trust',
       'Baytex Energy Ltd.', 'Shell Canada Energy',
       'Husky Oil Operations Ltd.', 'Action Energy Inc.',
       'Devon Canada Corporation', 'Bonavista Petroleum Ltd.',
       'Murphy Oil Canada', 'Watch Resources Ltd.',
       'Manitok Exploration Inc.', 'Koch Exploration Canada G/P Ltd.',
       'Husky Oil Operations Limited (et Al)', 'Twin Butte Energy Ltd.',
       'Sunshine Oilsands Ltd.', 'Solara Exploration Ltd.',
       'Canadian Natural Resources Limited (¹²)',
       'Bellatrix Exploration Ltd. (¹²)', 'Blackpearl Resources Inc. (¹)',
       'Crispin Energy Inc.', 'Husky Oil Operations Limited (¹)',
       'Buffalo Resources Corp.', 'ExxonMobil Canada Ltd.',
       'Harvest Operations Corp.', 'Bronco Energy Ltd.',
       'Frog Lake Energy Resources Corp.',
       'Koch Exploration Canada G/P Ltd. (¹)'], dtype=object)

The Operator column contains several names that are followed by (subscripts). These additional pieces of information create inconsistencies in the Operator column and add unnecessary complexity to the dataset. Since they are not needed for our analysis, we can remove them using the replace method.

bitumen_2010['Operator'] = bitumen_2010['Operator'].str.replace(r'\(.*\)', '').str.strip()
bitumen_2010['Operator'].unique()

array(['Canadian Natural Resources Limited', 'Baytex Energy Ltd.',
       'North Peace Energy Corp.', 'Penn West Petroleum Ltd.',
       'Shell Canada Limited', 'Imperial Oil Resources',
       'Devon NEC Corporation', 'Connacher Oil And Gas Limited',
       'Meg Energy Corp.', 'Statoil Canada Ltd.', 'Cenovus Energy Inc.',
       'Cenovus FCCL Ltd.', 'Suncor Energy Inc.',
       'Japan Canada Oil Sands Limited', 'Total E&P Joslyn Ltd.',
       'ConocoPhillips Canada Resources Corp.',
       'ConocoPhillips Canada Limited', 'Nexen Inc.',
       'Husky Oil Operations Limited', 'Canadian Natural Resources',
       'Murphy Oil Company Ltd.', 'Prosper Petroleum Ltd.',
       'Penn West Energy Trust', 'Shell Canada Energy',
       'Husky Oil Operations Ltd.', 'Action Energy Inc.',
       'Devon Canada Corporation', 'Bonavista Petroleum Ltd.',
       'Murphy Oil Canada', 'Watch Resources Ltd.',
       'Manitok Exploration Inc.', 'Koch Exploration Canada G/P Ltd.',
       'Twin Butte Energy Ltd.', 'Sunshine Oilsands Ltd.',
       'Solara Exploration Ltd.', 'Bellatrix Exploration Ltd.',
       'Blackpearl Resources Inc.', 'Crispin Energy Inc.',
       'Buffalo Resources Corp.', 'ExxonMobil Canada Ltd.',
       'Harvest Operations Corp.', 'Bronco Energy Ltd.',
       'Frog Lake Energy Resources Corp.'], dtype=object)

The operator names in the dataset may contain inconsistencies where some operators are listed with different names. To make the names consistent, we can create a dictionary with the names to be replaced and the new names. However, it’s important to note that the following dictionary may contain additional names that are not in the 2010 dataset. This dictionary was created using data from all the files between 2010 and 2022.

# create a dictionary to make Operator names consistent
operators_dict = {'Athabasca Oil Corporation': 'Athabasca Oil',
                  'Baytex Energy Ltd.': 'Baytex',
                  'Blackpearl Resources Inc.': 'BlackPearl',
                  'BlackPearl Resources Inc.': 'BlackPearl',
                  'Bonavista Petroleum Ltd.': 'Bonavista',
                  'Bonavista Energy Corporation': 'Bonavista',
                  'CNOOC Petroleum North America ULC': 'CNOOC',
                  'Canadian Natural Resources': 'CNRL',
                  'Canadian Natural Resources Limited': 'CNRL',
                  'Canadian Natural Upgrading Limited': 'CNRL',
                  'Cenovus Energy Inc.': 'Cenovus',
                  'Cenovus FCCL Ltd.': 'Cenovus',
                  'ConocoPhillips Canada Resources Corp.': 'ConocoPhillips',
                  'ConocoPhillips Canada Limited': 'ConocoPhillips',
                  'Devon Canada Corporation': 'Devon',
                  'Devon NEC Corporation': 'Devon',
                  'ExxonMobil Canada Ltd.': 'Imperial',
                  'Greenfire Hangingstone Operating Corporation': 'Greenfire',
                  'Greenfire Resources Operating Corporation': 'Greenfire',
                  'Husky Oil Operations Limited' : 'Husky',
                  'Husky Oil Operations Ltd.' : 'Husky',
                  'Imperial Oil Resources': 'Imperial',
                  'Imperial Oil Resources Limited': 'Imperial',
                  'Islander Oil & Gas Inc.': 'Islander',
                  'Koch Exploration Canada G/P Ltd.': 'Koch',
                  'Koch Oil Sands Operating ULC': 'Koch',
                  'MEG Energy Corp.': 'MEG',
                  'Meg Energy Corp.': 'MEG',
                  'Murphy Oil Canada': 'Murphy',
                  'Murphy Oil Company Ltd.': 'Murphy',
                  'Nexen Energy ULC': 'Nexen',
                  'Nexen Inc.': 'Nexen',
                  'Obsidian Energy Ltd.': 'Obsidian',
                  'OSUM Oil Sands Corp.': 'OSUM',
                  'Osum Production Corp.': 'OSUM',
                  'Pengrowth Corporation': 'Pengrowth',
                  'Pengrowth Energy Corporation': 'Pengrowth',
                  'Penn West Energy Trust': 'Penn West',
                  'Penn West Petroleum Ltd.': 'Penn West',
                  'Perpetual Energy Inc.': 'Perpetual',
                  'Perpetual Energy Operating Corp.': 'Perpetual',
                  'Perpetual Operating Corp.': 'Perpetual',
                  'PetroChina Canada Ltd.': 'PetroChina',
                  'Petrochina Canada Ltd.': 'PetroChina',
                  'Strathcona Resources Ltd.': 'Strathcona',
                  'Shell Canada Energy': 'Shell',
                  'Shell Canada Limited': 'Shell',
                  'Spur Petroleum Ltd.': 'Spur',
                  'Spur Resources Ltd.': 'Spur',
                  'Suncor Energy Inc.': 'Suncor',
                  'Woodcote Oil & Gas Inc.': 'Woodcote',
                  'Woodcote Oil Corp.': 'Woodcote'
                  }

# replace the names in the Operator column
bitumen_2010['Operator'] = bitumen_2010['Operator'].replace(operators_dict)
bitumen_2010['Operator'].unique()

array(['CNRL', 'Baytex', 'North Peace Energy Corp.', 'Penn West', 'Shell',
       'Imperial', 'Devon', 'Connacher Oil And Gas Limited', 'MEG',
       'Statoil Canada Ltd.', 'Cenovus', 'Suncor',
       'Japan Canada Oil Sands Limited', 'Total E&P Joslyn Ltd.',
       'ConocoPhillips', 'Nexen', 'Husky', 'Murphy',
       'Prosper Petroleum Ltd.', 'Action Energy Inc.', 'Bonavista',
       'Watch Resources Ltd.', 'Manitok Exploration Inc.', 'Koch',
       'Twin Butte Energy Ltd.', 'Sunshine Oilsands Ltd.',
       'Solara Exploration Ltd.', 'Bellatrix Exploration Ltd.',
       'BlackPearl', 'Crispin Energy Inc.', 'Buffalo Resources Corp.',
       'Harvest Operations Corp.', 'Bronco Energy Ltd.',
       'Frog Lake Energy Resources Corp.'], dtype=object)

It should be noted that there have been a number of acquisitions and mergers in the oil and gas industry in recent years, and this can lead to inconsistencies in the naming of operators across different years. However, for the sake of simplicity, we will ignore these changes and assume that operator names remain consistent across all years.

Simplifiying the Area column

To simplify the Area column, we can replace the values Peace River Area 1 and Peace River Area 2 with a single value Peace River. This is done to aggregate the production data for the Peace River area.

bitumen_2010['Area'] = bitumen_2010['Area'].replace({'Peace River Area 1': 'Peace River', 'Peace River Area 2': 'Peace River'}) 
bitumen_2010['Area'].unique()

array(['Cold Lake', 'Peace River', 'Athabasca, Cold Lake', 'Athabasca'],
      dtype=object)

There is only one row with an Area value of Athabasca, Cold Lake. To make it consistent with the other rows, we can change it to Cold Lake.

bitumen_2010['Area'] = bitumen_2010['Area'].replace({'Athabasca, Cold Lake': 'Cold Lake'})

We also add a Year column to the dataframe.

bitumen_2010['Year'] = int(xls_files[0].split('_')[-1].split('-')[0])
bitumen_2010['Year'].unique()

array([2010], dtype=int64)

Creating a single tidy dataset for bitumen production for all years

We can combine all the previous steps to create a function that generates a tidy dataset for the bitumen analysis. The function takes the following arguments:

df: a dataframe for a given year
operators_dict: a dictionary with the names to be replaced and the new names
xls_file: the name of the excel file for a given year

# define a function to create a tidy dataset for bitumen production
def create_tidy_bitumen(df, operators_dict, xls_file):
    last_row_idx = np.where(df[["Recovery Method"]] == "Total ")[0][0]
    df = df.iloc[:last_row_idx, :]
    df = (df
            .drop(columns = ['Monthly Average'])
            .dropna(subset = ['Operator'])
            .assign(Operator = lambda x: x.Operator.str.split('(').str[0].str.strip())
            .assign(Operator = lambda x: x.Operator.str.replace('  ', ' '))
            .replace({'Operator': operators_dict})
            .assign(Year = int(xls_file.split('_')[-1].split('-')[0]))
            .assign(Area = lambda x: x.Area.replace({'Peace River Area 1': 'Peace River', 'Peace River Area 2': 'Peace River'}))
    )
    return df

To generate a tidy dataset for bitumen production for all years, we can use the previously defined function to create tidy datasets for each year and then use a for loop to iterate through all the .xls files. For each year, we generate a tidy dataset using the function and append it to the bitumen dataframe using the df.append() method.

# create a tidy dataset for bitumen production for all years
bitumen = pd.DataFrame()
for xls_file in xls_files:
    df = pd.read_excel(xls_file, sheet_name = None, skiprows = 3)
    bitumen = bitumen.append(create_tidy_bitumen(df['BITUMEN'], operators_dict, xls_file))

Let’s check the Year column to make sure that we have data for all years.

bitumen['Year'].unique()

array([2010, 2011, 2012, 2013, 2014, 2016, 2017, 2018, 2019, 2020, 2021,
       2022], dtype=int64)

Saving the dataset

We save the dataset as a csv file in the current local folder for future use.

bitumen.to_csv('bitumen.csv', index = False)